Chris Pollett > Old Classes >
CS267
( Print View )

Student Corner:
  [Grades Sec1]

  [Submit Sec1]

  [Class Sign Up Sec1]

  [Lecture Notes]
  [Discussion Board]

Course Info:
  [Texts & Links]
  [Topics/Outcomes]
  [Outcomes Matrix]
  [Grading]
  [HW/Quiz Info]
  [Exam Info]
  [Regrades]
  [Honesty]
  [Additional Policies]
  [Announcements]

HW Assignments:
  [Hw1] [Hw2] [Hw3]
  [Hw4] [Quizzes]

Practice Exams:
  [Mid 1] [Mid 2] [Final]

HW#2 --- last modified February 10 2019 22:00:08..

Solution set.

Due date: Sep 28

Files to be submitted:
Hw2.zip

Purpose:

Related Course Outcomes:

The main course outcomes covered by this assignment are:

LO1 -- Code a basic inverted index capable of performing conjunctive queries.

LO2 -- Be able to calculate by hand on small examples precision (fraction relevant results returned), recall (fraction of results which are relevant), and other IR statistics.

LO6 -- Be able to evaluate search results by hand and using TREC eval software.

Specification: For the first part of your homework I want you to expand upon the code you wrote for HW1. Now your program should be run from the command-line with a command like:

program_name folder_to_index query rank_algorithm

For C/C++, program name can be just the name of your program. For Java and PHP, the program name can be "java program_name" or "php program_name". Again, you must program in either C/C++, Java, or PHP and your program should work with a vanilla install of an at most one year old version of gcc/g++/Oracle's Java 1.6/Zend's php. The folder_to_index is, as in the last assignment, a folder containing text files. These will have all have the extension .txt and other files in this folder should be ignored. You should have a readme.txt file saying how to compile and run your program. For this homework, your program does not output the inverted index and related statistics to it. Instead, I want you to store these informations into a class which has the four public methods first, last, prev, and next, described in class for the inverted index ADT. Your implementation of prev and next should use galloping search. The command-line argument query above can be either a single word for example "bob" or it can be a string of words for example "bob smith". If it is more that one term, when I run your program I will put the sequence of terms in double-quotes, so that it will come in to your program as a single command-line argument. Finally, rank_algorithm is either the word "cosine" or the word "proximity". As an example, I might run your program with a line like:

php search_program.php test_folder "apple records" cosine

On such an input, your program should return a sequence of lines, each line containing the pair (some_filename, rank). Here some_filename should be a file which contains the word apple and contains the word records (not necessarily adjacent to each other). Here rank should be the cosine rank score for that document with respect to this query. You should output results in descending order of rank. In computing cosine rank, use TF_IDF values for the vector component where the TF function is the example one given in the book. If proximity is used rather than cosine, I want you to output the results according to the proximity ranking algorithm given in chapter 2 of the book. To compute only those documents in which all the terms appear I want you to use only the functions of the ADT you have built. To do this rewrite the exact phrase algorithm given in class to instead compute this kind of conjunctive query. Do this in pseudo-code as well as your own code and put your pseudo code in the file conjunctive.txt which you submit as part of your ZIP file.

As the last part of your assignment, I want you to create a small text collection of 20 files and put them in a folder test_folder as part of what you submit. Label your files 1.txt through 20.txt Make up a list of five queries (at least two involving more than one term). For each of your queries and each of your files make a 1 (yes), 0 (no) judgment about the relevance of that file to that query. Each of your queries should have the key words searched on as well as a description of the words you expect to get back. Put your queries, and descriptions in queries.txt. For instance, you might have a query "apple records", with a description of the query being the record company of the Beatles. So a document about Apple computers which mentions the word records should not be considered a relevant result for this query. Have a text file judgments.txt, with lines like (query_no, doc_no, relevance) which records this information. Next I want you to compute by hand the recall, precision, and MAP scores of your search program above with respect to each query using this information and what it returns. Put your calculations into the file scores.txt

All components of the grade breakdown below will be graded as either 0- did not work/do, 1/2 - partially works/partially achieved, 1 - full credit

Point Breakdown

Program follows departmental coding guidelines for C/C++/Java or Pear coding guidelines for PHP. Program is well commented	1pt
Program compiles as described above and runs from the command-line using syntax above.	1pt
Program has a working implementation of inverted index ADT using galloping search for next and prev.	1pt
conjunctive.txt has correct pseudo-code for an implementation of conjunctive queries.	1pt
Program outputs results in the format described.	1pt
Program outputs only documents which have all of the query terms.	1pt
Program computes cosine rank scores correctly and outputs them in descending order when cosine used.	1pt
Program computes proximity rank scores correctly and outputs them in descending order when proximity used.	1pt
Hw2.zip contained a folder test_folder with 20 files as described above. Hw2.zip contained the files queries.txt and topics.txt and they were as described above.	1pt
Hw2.zip contained the files scores.txt and it contains recall, MAP, precision scores calculated correctly for the program results with respect to those queries and cosine ranking.	1pt
Total	10pts